-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Add grapheme_levenshtein function. #18087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This RFC was accepted. I will go to implementation. |
Nice, could you possibly rebase with last master state to redo the CI ? cheers. |
Measure levenshtein for grapheme cluster unit
dd9a015
to
5fc7475
Compare
I rebased and force pushed. I'm waiting for CI end. |
should be fine, thanks. Looking good but I ll go back at it in couple of hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minus the remark
if (ustring1) { | ||
efree(ustring1); | ||
} | ||
RETURN_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that false is being returned on failure, rather than throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, Userland can uses intl_get_error_message
function after this function if this error. Therefore, this block is returns false.
Hmm... Windows CI is failed. |
ext/intl/grapheme/grapheme_string.c
Outdated
/* Set global error code. */ | ||
intl_error_set_code(NULL, ustatus1); | ||
|
||
/* Set error messages. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kind of comments are pointless, this just is the english version of the code, please avoid these kinds of comments.
ext/intl/grapheme/grapheme_string.c
Outdated
/* Set error messages. */ | ||
intl_error_set_custom_msg(NULL, "Error converting input string to UTF-16", 0); | ||
if (ustring1) { | ||
efree(ustring1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to NULL check this.
ext/intl/grapheme/grapheme_string.c
Outdated
/* Set error messages. */ | ||
intl_error_set_custom_msg(NULL, "Error converting input string to UTF-16", 0); | ||
if (ustring2) { | ||
efree(ustring2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to NULL check this (and same below).
ext/intl/grapheme/grapheme_string.c
Outdated
/* Set global error code. */ | ||
intl_error_set_code(NULL, ustatus2); | ||
|
||
/* Set error messages. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same pointless comment.
ext/intl/grapheme/grapheme_string.c
Outdated
intl_convert_utf8_to_utf16(&ustring2, &ustring2_len, pstr2, ZSTR_LEN(string2), &ustatus2); | ||
|
||
if (U_FAILURE(ustatus2)) { | ||
/* Set global error code. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same pointless comment.
ext/intl/grapheme/grapheme_string.c
Outdated
intl_convert_utf8_to_utf16(&ustring1, &ustring1_len, pstr1, ZSTR_LEN(string1), &ustatus1); | ||
|
||
if (U_FAILURE(ustatus1)) { | ||
/* Set global error code. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same pointless comment.
|
||
unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the error checks for these calls and the next ones?
ext/intl/grapheme/grapheme_string.c
Outdated
int32_t pos1 = 0; | ||
int32_t pos2 = 0; | ||
int32_t usrch_pos = 0; | ||
while (pos1 != UBRK_DONE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you should change this to while (true)
because otherwise it's confusing which check actually matters
RETURN_THROWS(); | ||
} | ||
|
||
zend_long *p1, *p2, *tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where possible, I think it's better to merge the declaration and assignment as it reduces the area/scope where something is valid/can be used.
ext/intl/grapheme/grapheme_string.c
Outdated
if (pos2 == UBRK_DONE) { | ||
break; | ||
} | ||
usrch_pos = grapheme_strpos_utf16(pstr1 + current1, pos1 - current1, pstr2 + current2, pos2 - current2, 0, NULL, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function call will make the conversion from UTF-8 to UTF-16 again and it seems very inefficient to do it this way. This is also the equivalent of strrpos
which doesn't seem to make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed.
However, usearch.h
requires UTF-16 because UChar
is UTF-16.
Maybe, possible to use u_uastrcpy
to usearch_open
.
Please give me a time. I'll try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unified internal encoding is UTF-16.
Measure levenshtein for grapheme cluster unit.
RFC: https://wiki.php.net/rfc/grapheme_levenshtein